On choosing a mixture model for clustering
نویسنده
چکیده
Two methods for clustering data and choosing a mixture model are proposed. First, we derive a new classification algorithm based on the classification likelihood. Then, the likelihood conditional on these clusters is written as the product of likelihoods of each cluster, and AICrespectively BIC-type approximations are applied. The resulting criteria turn out to be the sum of the AIC or BIC relative to each cluster plus an entropy term. The performances of our methods are evaluated by Monte-Carlo methods and on a real data set, showing in particular that the iterative estimation algorithm converges quickly in general, and thus the computational load is rather low.
منابع مشابه
On Model-Based Clustering, Classification, and Discriminant Analysis
The use of mixture models for clustering and classification has burgeoned into an important subfield of multivariate analysis. These approaches have been around for a half-century or so, with significant activity in the area over the past decade. The primary focus of this paper is to review work in model-based clustering, classification, and discriminant analysis, with particular attenti...
متن کاملExtracting Prior Knowledge from Data Distribution to Migrate from Blind to Semi-Supervised Clustering
Although many studies have been conducted to improve the clustering efficiency, most of the state-of-art schemes suffer from the lack of robustness and stability. This paper is aimed at proposing an efficient approach to elicit prior knowledge in terms of must-link and cannot-link from the estimated distribution of raw data in order to convert a blind clustering problem into a semi-supervised o...
متن کاملClustering with multiple distance metrics - mixture models with profile transformations
Clustering methods often require the selection of a distance metric; how do we define data objects as ’close’ enough to be grouped together, or ’far’ enough apart to be separated? Choosing an appropriate distance metric is not always easy. We consider high-dimensional gene expression data as an example. The shape of a gene’s expression profile across experimental conditions is often considered ...
متن کاملClustering Based on a Multi-layer Mixture Model
In model-based clustering, the density of each cluster is usually assumed to be a certain basic parametric distribution, e.g., the normal distribution. In practice, it is often difficult to decide which parametric distribution is suitable to characterize a cluster, especially for multivariate data. Moreover, the densities of individual clusters may be multi-modal themselves, and therefore canno...
متن کاملImproving SimPoint accuracy for small simulation budgets with EDCM clustering
Detailed processor simulation is extremely costly on large benchmark suites, where each program may run for billions of instructions and take months of simulation time. We can obtain good approximate answers in less time using limited simulation, but deciding which regions to simulate is a difficult problem. SimPoint is one approach for choosing simulation regions, based on the k-means clusteri...
متن کامل